DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods
The concept of a sampling error recognises that there is sample to sample—or experiment to experiment—variation whenever we summarise data using the “tools” described in T01
A convenient feature of random samples and random assignment is that we can characterise the distribution of a statistic, such as a sample mean, \(\bar{x}\), or a T-test statistic, \(t_0\)—in way that accounts for sampling error
The survey literature makes a distinction between sampling errors, which arise from the decision to take a sample rather than trying to survey the whole population (which is what a census tries to do)
— Wild & Seber (2000)
We know that Utah (US state) has five national parks, whose population mean area is 261.8 square miles (sq. miles).
What are the possible sample mean areas we could observe, if we took a simple random sample of two national parks?
The total 1309. 87.5, 323, 248.5, 174, 291.5, 217, 142.5, 452.5, 378, 303.5
A sampling distribution is the distribution of sample statistics computed for different samples of the same size from the same population (or process).
A sampling distribution shows us how the sample statistics varies from sample to sample.
— Lock et al. (2021)
The distribution of the possible sample means when \(n = 10\), under simulation, is unimodal (one peak) and symmetrical
What about the shape?
We call this particular shape a bell-shaped distribution, and we can also describe the sampling distribution of the sample mean as approximately Normally distributed.
The phrase Normally distributed means that we could use the data to build a model to explain the chance of observing an interval of values
The mathematical details relevant for us in DATAX121 is that:
If the population mean, \(\mu\), and population standard deviation, \(\sigma\), are known—The ground “truths” (parameters) that summarise all possible values we could observe
The sampling distribution of the sample mean, \(\bar{x}\), is
\[ \bar{x} ~ \text{approx.} ~ \text{Normal} \! \left(\mu_\bar{x} = \mu, \sigma_\bar{x} = \frac{\sigma}{\sqrt{n}}\right) \]
The use of the \(\bar{x}\) subscripts is to make it clear that we are talking about the sampling distribution of \(\bar{x}\) and not the possible values we could observe
The \(\mu = 33.5\) and \(\sigma = 20.5\) for the population of wood blocks
Sample Size
• \(n=10\)
• \(n=25\)
• \(n=50\)
Estimates (from simulations)
\(\bar{x} = 33.47\) and \(s = 6.21\)
\(\bar{x} = 33.44\) and \(s = 3.53\)
\(\bar{x} = 33.55\) and \(s = 2.03\)
Theoretical (as we know the ground “truths”)
\(\mu_\bar{x} = 33.50\) and \(\sigma_\bar{x} = 6.48\)
\(\mu_\bar{x} = 33.50\) and \(\sigma_\bar{x} = 4.10\)
\(\mu_\bar{x} = 33.50\) and \(\sigma_\bar{x} = 2.90\)
We sample or experiment because we do not know the ground “truths” (parameters). Hence the idea of population ⇝ sample ⇢ population
If we only have one sample, the (descriptive) statistics calculated from this sample are our best estimate of the ground “truths”
Coupled with our understanding sampling distributions of statistics, when we take random samples1
We can quantify the uncertainty of the statistics calculated from the one sample!
The standard error1 of the sample mean, \(\bar{x}\), is
\[ \text{se}(\bar{x}) = \frac{s}{\sqrt{n}} \]
where:
Recall that the theoretical passage time for Newcomb’s experiment was 24.8296 millionths of a second.
The sample mean, \(\bar{x}\), is:
\(\bar{x} = \ldots = 24.83 ~ (2 ~ \text{dp})\)
The standard error, \(\text{se}(\bar{x})\), is:
\(\text{se}(\bar{x}) =\displaystyle \frac{s}{\sqrt{n}}\)
\(\phantom{\text{se}(\bar{x})} = \displaystyle \frac{0.0051}{\sqrt{20}}\)
\(\phantom{\text{se}(\bar{x})} = 0.0011 ~ (4 ~ \text{dp})\)
\(\text{se}(\bar{x})\) should look familiar—revisit Slide 12
The critical difference is that \(\text{se}(\bar{x})\) is quantified with a statistic, whereas the latter, \(\sigma_\bar{x}\), is defined in terms of \(\sigma\), a ground “truth” about the spread of all possible values
Thankfully, we can handle this “minute”1 difference with a different tool, the Student’s t-distribution, another kind of bell-shaped distribution
In particular, the Student’s t-distribution allows us to quantify the precision of \(\bar{x}\), our best estimate of \(\mu\), for any sample size (with some assumptions)
A confidence interval for a parameter is an interval computed from sample data by a method that will capture the parameter for a specified proportion of all samples.
The success rate (proportion of all samples whose intervals contain the parameter) is known as the confidence level.
— Lock et al. (2021)
More on 3.
In practice:
\[ \bar{x} \pm t^*_{1-\alpha/2}(\nu) \times \text{se}(\bar{x}) \]
where:
blocks.df <- read.csv("datasets/random-blocks.csv")
t.test(Weight ~ 1, data = blocks.df, conf.level = 0.95)
One Sample t-test
data: Weight
t = 5.1279, df = 9, p-value = 0.0006213
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
16.02792 41.33208
sample estimates:
mean of x
28.68
Note
I will provide the relevant R output if you are asked to calculate any confidence interval for a test or exam
Sample data from a similar woodblock exercise used in the first lecture. The exercise aimed to estimate the average block weight using only a sample of blocks.
| Variables | |
|---|---|
| Block.ID | An integer between 1–100 denoting the block’s identification number |
| Weight | A number denoting the weight of the block (grams) |
\[ \bar{x} \pm t^*_{1-\alpha/2}(\nu) \times \text{se}(\bar{x}) \]
[1] 28.68
[1] 17.68639
[1] 10
[1] 3.162278
[1] 2.262157
5.592928
16.02792 41.33208
Style One
We are 95% sure that the mean block weight for the 100 blocks is somewhere between 16.03 and 41.33 grams.
Style Two
With 95% confidence, we estimate that the mean block weight of the 100 blocks is somewhere between 16.03 and 41.33 grams.
Critical features
Sample Size
• \(n=10\)
• \(n=25\)
• \(n=50\)
% Coverage (from simulations)
\(93.00\%\)
\(95.95\%\)
\(98.66\%\)
Simon Newcomb1 experimented with a new method of measuring the speed of light in 1882, which involved using two different mirrors placed approximately 3721.865 metres apart. The following data comes from 20 repeated measurements of the passage time for light to travel from one mirror to another and back again.
The theoretical passage time for the above distance was 24.8296 millionths of a second. If this new method is unbiased and precise, the experimental data should agree with the theoretical passage time.
| Variables | |
|---|---|
| pass.time | A number denoting the passage time for light to travel from one mirror to another and back again (millionths of a second, μs) |
# Calculate and assign a several statistics to their own objects
xbar <- mean(lightspeed.df$pass.time)
n <- nrow(lightspeed.df)
se <- sd(lightspeed.df$pass.time) / sqrt(n)
t.mult <- qt(1 - 0.05 / 2, df = n - 1)
c(xbar, n, se, t.mult)[1] 24.828550000 20.000000000 0.001145874 2.093024054
[1] 24.82615
[1] 24.83095
Recall that the theoretical passage time for Newcomb’s experiment was 24.8296 millionths of a second.
The Avogadro constant is a fundamental quantity in chemistry specifying the number of molecules in one mole of a substance. It can only be determined by experiment, for example, using an electrochemical cell. The accepted value is 6.022 × 1023.
A chemist conducts five repeats of an experiment to estimate Avogadro’s constant and obtains the following output
\[ \bar{x} = 5.78 \times 10^{23}, \quad s = 0.20 \times 10^{23}, \quad t^\ast_{0.975}(4) = 2.78, \quad t^\ast_{0.975}(5) = 2.57 \] Construct and interpret a 95% confidence interval for \(\mu\). You may assume the assumptions to construct such an interval have been met.